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Abstract 

We  analyze  the  performance  of  several  copying  garbage  collection  algorithms  in  a  large 
address  space  offered  by  modern  architectures.  In  particular,  we  describe  the  design  and  im¬ 
plementation  of  the  Real  OF  garbage  collector,  an  algorithm  explicitly  designed  to  exploit  the 
features  of  64-bit  environments.  This  collector  maintains  a  correspondence  between  object 
age  and  object  placement  in  the  address  space  of  the  heap.  It  allocates  and  copies  objects 
within  designated  regions  of  memory  called  zones  and  performs  garbage  collection  incremen¬ 
tally  by  collecting  one  or  more  ranges  of  memory  called  windows.  The  windows  arc  managed 
so  as  to  collect  middle-aged  objects,  rather  than  almost  always  collecting  young  objects,  as 
with  a  generational  collector.  The  address-ordered  heap  allows  us  to  use  the  same  inexpensive 
write  barrier  that  works  for  generational  collectors.  We  show  that  for  server  applications  this 
algorithm  improves  throughput  and  reduces  heap  size  requirements  over  the  best-throughput 
generational  copying  algorithms  such  as  the  Appel-style  generational  collector. 


1  Introduction 

Server-side  64-bit  computing  today  is  characterized  by  very  large  physical  memory  support,  very 
large  application  virtual  address  spaces,  and  64-bit  integer  computation  using  64-bit  general- 
purpose  registers.  In  such  systems,  an  application’s  virtual  address  space  is  measured  in  terabytes 
and  an  increasing  number  of  programs  can  exploit  this  opportunity.  Database  servers  use  a  large 
address  space  for  scalability,  maintaining  buffer  pools,  caches,  and  sort  heaps  in  memory  to  reduce 
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maintaining  the  data  needed,  and  completing  and  reviewing  the  collection  of  information.  Send  comments  regarding  this  burden  estimate  or  any  other  aspect  of  this  collection  of  information, 
including  suggestions  for  reducing  this  burden,  to  Washington  Headquarters  Services,  Directorate  for  Information  Operations  and  Reports,  1215  Jefferson  Davis  Highway,  Suite  1204,  Arlington 
VA  22202-4302.  Respondents  should  be  aware  that  notwithstanding  any  other  provision  of  law,  no  person  shall  be  subject  to  a  penalty  for  failing  to  comply  with  a  collection  of  information  if  it 
does  not  display  a  currently  valid  OMB  control  number. 


1.  REPORT  DATE 

01  APR  2005 


2.  REPORT  TYPE 


4.  TITLE  AND  SUBTITLE 

Comparison  of  Garbage  Collectors  Operating  in  a  Large  Address  Space 


6.  AUTHOR(S) 


7.  PERFORMING  ORGANIZATION  NAME(S)  AND  ADDRESS(ES) 

University  of  New  Mexico, Computer  Science  Department 
,Albuquer  que,NM,8713 1 

9.  SPONSORING/MONITORING  AGENCY  NAME(S)  AND  ADDRESS(ES) 


3.  DATES  COVERED 

00-00-2005  to  00-00-2005 

5a.  CONTRACT  NUMBER 

5b.  GRANT  NUMBER 

5c.  PROGRAM  ELEMENT  NUMBER 

5d.  PROIECT  NUMBER 

5e.  TASK  NUMBER 

5f.  WORK  UNIT  NUMBER 

8.  PERFORMING  ORGANIZATION 
REPORT  NUMBER 


10.  SPONSOR/MONITOR'S  ACRONYM(S) 

11.  SPONSOR/MONITOR'S  REPORT 
NUMBER(S) 


12.  DISTRIBUTION/AVAILABILITY  STATEMENT 

Approved  for  public  release;  distribution  unlimited 

13.  SUPPLEMENTARY  NOTES 

14.  ABSTRACT 

We  analyze  the  performance  of  several  copying  garbage  collection  algorithms  in  a  large  address  space 
offered  by  modern  architectures.  In  particular,  we  describe  the  design  and  implementation  of  the  RealOF 
garbage  collector,  an  algorithm  explicitly  designed  to  exploit  the  features  of  64-bit  environments.  This 
collector  maintains  a  correspondence  between  object  age  and  object  placement  in  the  address  space  of  the 
heap.  It  allocates  and  copies  objects  within  designated  regions  of  memory  called  zones  and  performs 
garbage  collection  incrementally  by  collecting  one  or  more  ranges  of  memory  called  windows.  The 
windows  are  managed  so  as  to  collect  middle-aged  objects,  rather  than  almost  always  collecting  young 
objects,  as  with  a  generational  collector.  The  address-ordered  heap  allows  us  to  use  the  same  inexpensive 
write  barrier  that  works  for  generational  collectors.  We  show  that  for  server  applications  this  algorithm 
improves  throughput  and  reduces  heap  size  requirements  over  the  best-throughput  generational  copying 
algorithms  such  as  the  Appel-style  generational  collector. 

15.  SUBIECT  TERMS 


16.  SECURITY  CLASSIFICATION  OF: 

17.  LIMITATION  OF 

18.  NUMBER 

19a.  NAME  OF 

ABSTRACT 

OF  PAGES 

RESPONSIBLE  PERSON 

a.  REPORT 

unclassified 

b.  ABSTRACT 

unclassified 

c.  THIS  PAGE 

unclassified 

Same  as 
Report  (SAR) 

17 

Standard  Form  298  (Rev.  8-98) 

Prescribed  by  ANSI  Std  Z39-18 


the  volume  of  I/O  they  perform.  Simulation  and  other  computationally  intensive  programs  ben¬ 
efit  from  keeping  much  larger  arrays  of  data  entirely  in  memory.  Finally,  another  large  group  of 
programs,  application  servers,  has  been  deployed  on  64-bit  platforms  for  some  time  now. 

Some  of  these  applications  heavily  rely  on  Java  technology,  and  this  has  forced  leading  com¬ 
panies  like  IBM,  Sun,  and  BEA  to  introduce  64-bit  versions  of  their  Java  Virtual  Machines Q  As 
a  result,  64-bit  computing  introduces  a  new  set  of  research  opportunities  in  the  field  of  virtual 
machines  related  both  to  evaluating  previously  existing  32-bit  solutions  in  the  64-bit  world  and 
inventing  brand-new  approaches  that  specifically  exploit  the  benefits  of  64-bit  architectures. 

Server-side  applications  tend  to  have  a  very  high  heap  object  allocation  rate.  When  the  heap 
is  full,  garbage  collection  must  free  some  space  in  it  to  allow  the  application  to  continue  running. 
Concurrent  collectors  can  be  used  for  the  old  generation  of  generational  collectors.  However, 
employing  a  concurrent  collector  as  the  only  collector,  in  order  to  completely  remove  garbage 
collection  pauses,  may  not  be  acceptable  under  the  prevailing  circumstances  today,  viz.,  server 
systems  with  just  one  or  two  processors,  as  such  collectors  tend  to  reduce  throughput  significantly. 
Our  experimental  results  are  obtained  on  a  system  of  this  kind,  an  Apple  G5.  For  such  systems 
“stop-the-world”  garbage  collection  remains  a  viable  option  as  long  as  the  collection  pauses  are 
reasonably  short. 

Previously,  we  proposed  an  older- first  garbage  collector  [(SMM991.  which  differs  from  gen¬ 
erational  collectors  in  that  it  does  not  always  collect  the  youngest  data  along  with  the  older  data. 
Similar  to  generational  collectors,  it  relies  only  on  relative  object  age,  deduced  from  object  position 
in  the  heap,  to  make  decisions  about  which  sets  of  objects  to  collect.  As  described,  and  as  imple¬ 
mented  in  the  present  paper,  it  does  not  take  advantage  of  static  analysis  f HDH03 1  or  profiling- 
based  heuristics  [BSH+0llHSF03l.  though  it  could.  In  our  earlier  work,  we  demonstrated  that  an 
emulation  of  this  algorithm  in  a  32-bit  address  space  can  have  good  performance  flSHB+02l.  Here 
for  the  first  time  we  have  an  implementation  of  the  algorithm  as  originally  envisaged,  in  a  large 
address  space  (except  that  special  treatment  of  permanent  data  (“pretenuring”)  is  still  lacking).  As 
the  results  will  show,  excellent  throughput  results  are  achieved  with  this  algorithm,  especially  in 
tight  heaps  where  it  matters  the  most. 


2  Background 

The  conceptual  design  of  the  RealOF  collector  has  been  described  fully  elsewhere  [  SMM99 , 
Stc99il;  here  for  completeness  we  sketch  the  main  points.  A  traditional  generational  garbage  collec¬ 
tor  always  collects  a  youngest  subset  of  heap  objects  (i.e.,  some  number  of  youngest  generations). 
An  older-first  collector,  on  the  other  hand,  chooses  a  middle-aged  subset  of  heap  objects.  Imagine 
a  heap  logically  laid  out  with  objects  in  the  order  of  their  age:  this  is  depicted  in  Figure  [U  with 
the  oldest  objects  on  the  right,  and  the  most  recently  allocated  objects  on  the  left.  The  older-first 
collector  chooses  to  collect  a  subset  C,  called  the  collection  window,  that  is  immediately  to  the  left 
of  the  survivors  of  the  previous  collection.  The  current  collection’s  survivors  S  are  left  in  place 
(logically).  After  a  collection,  an  amount  of  free  space  |C  —  Sj  is  available  for  new  allocation.  In 

1  We  survey  commercial  64-bit  JVM  platforms  elsewhere  |Kyr05|. 
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this  logical  view,  the  heap  remains  laid  out  in  age  order,  so  the  free  space  shows  up  on  the  far 
left,  where  it  is  available  for  new  allocation  (green  arrows).  Thus,  the  window  sweeps  the  heap 
from  older  to  younger,  hence  the  name  older-first.  Initially,  objects  fill  the  entire  heap  and  the 
window  is  positioned  at  the  old  end  of  the  heap.  Eventually  the  window  reaches  the  young  end; 
after  collecting  the  young  end  of  the  heap,  the  window  is  reset  to  the  old  end. 

Figure  [I]  shows  a  series  of  eight  collections,  and  indicates  how  the  window  moves  across  the 
heap  when  the  collector  is  performing  well.  If  the  window  is  in  a  position  that  results  in  small 
survivor  set  sizes  |Sj  (Collections  4-8),  the  window  moves  by  only  that  small  amount  from  one 
collection  to  the  next.  As  the  window  moves  slowly,  it  remains  for  a  long  time  in  the  same  logical 
region.  In  a  copying-collector  implementation,  this  means  that  a  great  deal  of  allocation,  |C  —  Sj, 
takes  place  with  little  copying  work,  |Sj;  in  other  words,  that  the  performance  is  good.  The  reason 
why  this  behavior  might  be  expected  to  arise  in  some  programs  is  that  the  position  of  the  window 
in  the  heap  corresponds  to  object  age,  and  it  has  been  observed  that  object  lifetimes  tend  to  cluster 
around  a  few  dominant  values.  Whilst  a  generational  collector  takes  advantage  of  the  cluster 
around  zero  (“most  objects  die  young”),  the  older-first  collector  may  take  advantage  of  middle- 
aged  lifetime  clusters. 


Figure  1:  Older-first  window  motion  example. 

Fike  generational  collectors,  the  older-first  collector  collects  less  than  the  entire  heap  each 
time,  and  thus  it  must  maintain  a  write  barrier  and  remember  certain  pointer  updates.  The  general 
rule  is  that  when  a  store  creates  a  reference  p  — *  q,  we  need  to  remember  it  only  if  q  might  be 
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collected  before  p.  Figure  |2  illustrates  this  rule  applied  to  the  older-first  collector.  The  crossed- 
out  pointers  need  not  be  remembered.  It  might  seem  complicated  to  apply  this  rule  in  a  write 
barrier.  However,  if  a  large  address  space  is  available,  objects  can  be  laid  out  in  age  order,  as  we 
detail  in  Section  [3]  The  allocation  starts  into  highest  addresses  in  an  allocation  zone,  and  copying 
is  into  lower  addresses  in  another  zone.  Once  the  allocation  zone  is  exhausted,  the  copying  zone 
becomes  the  new  allocation  zone,  and  another  chunk  of  address  space  is  made  into  the  new  copying 
zone.  In  a  large  address  space  we  can  do  this  for  a  very  long  time.  Now  the  rule  for  the  write 
barrier  filtering  is  little  more  than  an  address  comparison,  as  shown  in  Figure  Q]  the  same  as  in 
efficient  write  barriers  of  generational  collectors.  Here  for  the  first  time  we  present  a  complete 
implementation  of  the  older-first  collector  in  a  large  address  space,  which  we  now  label  RealOF. 
Previously  we  reported  a  32-bit  implementation  that,  in  the  absence  of  a  large  address  space, 
resorted  to  indirection  through  an  age  lookup  table  in  order  to  resolve  write  barriers  1SHB  5)21; 
we  label  it  OF  in  the  present  paper. 


Figure  2:  Directional  filtering  of  pointer  stores:  crossed-out  pointers  need  not  be  remembered. 
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Figure  3:  Directional  filtering  with  an  address-ordered  heap. 


3  Implementation 

3.1  Infrastructure 

Our  implementation  framework  is  the  Jikes  Research  Virtual  Machine  (Jikes  RVM),  developed 
by  IBM  Research  I  AAB+OOl  IaAB+991.  an  open-source  virtual  machine  capable  of  running  a 
wide  variety  of  Java  programs.  It  offers  two  compilers,  baseline  and  optimizing,  but  uses  no 
interpreter.  We  built  our  infrastructure  in  stages,  as  follows.  We  ported  Jikes  RVM  version  2.0.3, 
with  the  baseline  compiler  alone,  to  the  64-bit  Power  PC/ AIX  platform  (jKyrUS}  KSM04I.  and  then 
we  extended  this  port  to  the  PowerPC/Linux  architecture  and  specifically  the  Apple  G5.  This  gave 
us  the  only  64-bit  open-source  virtual  machine  (at  the  time)  that  provided  a  flexible  test-bed  for 
implementing  new  memory  managements  algorithms,  owing  to  the  presence  in  Jikes  RVM  of  the 
easily  pluggable  Garbage  Collection  Toolkit  GCTk  (now  MMTk  I  BCM04I)PI  The  GCTk  already 

4ittp://www.cs. umass.edu/~gctk/ 
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contained  fast  implementations  of  generational  copying  collection  algorithms,  as  well  as  of  the 
OF  and  Beltway  collectors;  to  this  we  added  our  implementation  of  the  RealOF  collector  (and 
allocator).  Using  this  system,  we  obtained  encouraging  preliminary  performance  results  [  KS05  ] . 
We  then  ported  the  Jikes  RVM  optimizing  compiler  to  the  64-bit  PowerPC/Linux  architecture. 
It  is  in  this  system  that  the  results  we  report  below  were  obtained.  (In  the  meantime,  we  have 
contributed  our  work  to  the  effort,  led  by  Kris  Venstermans  of  the  University  of  Ghent  and  David 
Grove  of  IBM,  to  port  the  newest  version  of  Jikes  RVM  to  64-bit  PowerPC/Linux.) 


3.2  Collector  and  allocator 


The  implementation  of  the  RealOF  algorithm  supports  the  notions  of  zones  and  windows  as  de¬ 
scribed  above,  but  they  are  of  necessity  discretized  in  size  and  tied  into  the  functioning  of  the 
allocator.  A  zone  is  a  contiguous  region  of  memory  and  the  largest  logical  memory  unit  of  the 
RealOF  collector.  All  zones  are  of  equal,  power-of-2  size  (in  our  experiments,  8  GB),  and  are  allo¬ 
cated  from  higher  to  lower  addresses  in  order  to  maintain  the  address-order  heap.  At  any  moment 
in  time  the  algorithm  has  two  zones:  the  allocation  zone  and  the  copy  zone.  Newly  created  objects 
are  placed  in  the  allocation  zone,  from  higher  addresses  to  lower.  During  a  garbage  collection, 
survivors  are  placed  in  the  copy  zone,  from  higher  addresses  to  lower. 

A  zone  consists  of  a  number  of  windows.  A  window  is  a  contiguous,  power-of-2  size,  region 
of  memory,  smaller  than  a  zone,  allocated  within  a  particular  zone  from  higher  to  lower  addresses. 
In  our  implementation  a  window  is  the  smallest  unit  of  memory  allocation  and  deallocation.  Thus 
every  garbage  collection  increment  collects  exactly  one  window.  The  size  of  a  window  is  limited 
from  below  by  the  minimum  size  of  mappable  virtual  memory,  which  in  our  operating  system  is 
4MB. 


The  RealOF  allocator  is  a  relatively  simple  and  fast  bump-pointer  allocator,  attached  to  the 
current  allocation  window.  The  allocator  actually  implements  allocation  in  either  direction,  but 
our  experiments  have  shown,  somewhat  surprisingly,  that  there  is  no  performance  difference  be¬ 
tween  the  two.  If  a  particular  architecture  supports  hardware  data  prefetching  such  as  within  cache 
lines,  consistent  with  a  lower-to-higher  access  order,  we  might  expect  to  suffer  a  performance  hit 
allocating  objects  from  higher  to  lower  addresses.  In  reality,  there  is  none  (on  our  PowerPC  sys¬ 
tem).  Note  that  with  either  direction  of  allocation  the  object  layout  remains  the  same,  with  array 
objects  laid  out  from  lower  to  higher  addresses  and  scalar  objects  laid  out  from  higher  to  lower 
addresses  llAAB+99l  and  the  address  access  pattern  of  the  object  initialization  sequence  thus  re¬ 
mains  unaffected.  There  is  apparently  no  observable  memory  system  effect  spanning  multiple 
consecutively  allocated  objects. 

We  illustrate  the  progress  of  the  algorithm  in  Figure  |4j  At  the  onset  of  virtual  machine  execu¬ 
tion,  both  the  allocation  and  the  copy  zone  are  empty.  We  allocate  the  very  first  window  inside  the 
allocation  zone  and  start  placing  newly  created  objects  inside  this  window  using  our  simple  bump 
pointer  allocator.  When  the  first  window  fills  up,  we  allocate  another  window  inside  the  allocation 
zone  and  proceed  without  garbage  collection  (1).  The  first  garbage  collection  happens  when  the 
number  of  windows  in  the  heap  becomes  equal  to  the  maximum  allowed  number  of  windows: 


heapsize 

windows  = - 1 , 

window  size 
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copy  zone 
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copy  zone 
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allocation  zone 
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copy  zone 


allocation  zone 
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Figure  4:  The  RealOF  algorithm  shown  operating  with  three  windows  in  the  heap  (completely 
or  partially  gray  rectangles)  and  one  copy  reserve  window  (cr).  Allocation  proceeds  from  higher 
addresses  to  lower  (right  to  left);  similarly,  copying  into  the  copy  zone  proceeds  from  higher  ad¬ 
dresses  to  lower.  Snapshots,  top  to  bottom:  (1)  before  any  GC  occurred;  (2)  after  several  GCs, 
with  most  windows  residing  in  the  allocation  zone;  (3)  after  several  more  GCs,  with  most  windows 
residing  in  the  copy  zone;  (4)  right  after  a  zone  reset;  (5)  after  several  GCs  following  a  zone  reset, 
with  most  windows  residing  in  the  allocation  zone. 


where  heapsize  is  rounded  down  to  fit  an  integer  number  of  windows  and  1  accounts  for  the  copy 
reserve  window.  The  maximum  number  of  windows  in  the  heap  is  maintained  as  an  invariant  of 
the  algorithm.  At  every  garbage  collection  increment,  we  collect  the  highest  (rightmost)  window 
in  the  heap,  copy  its  survivors  to  a  window  allocated  in  the  copy  zone,  and  deallocate  the  collected 
window  (2).  If  the  ratio  of  surviving  objects  is  relatively  high,  in  order  to  satisfy  the  allocation 
request  we  may  need  to  perform  several  garbage  collection  increments  and  collect  more  than  one 
window  in  the  allocation  zone,  creating  additional  windows  in  the  copy  zone. 

At  some  point,  the  number  of  windows  in  the  copy  zone  may  become  larger  than  the  number 
of  windows  in  the  allocation  zone  (3)  and  eventually  all  allowed  windows  may  end  up  in  the  copy 
zone.  This  situation  is  called  zone  reset.  When  the  zone  reset  occurs  we  rebind  the  current  copy 
zone  to  function  as  the  new  allocation  zone  and  create  a  new  copy  zone  right  below  the  old  one  (4). 
Note  that  here  we  have  a  situation  similar  to  the  one  before  the  first  garbage  collection,  when  all  the 
windows  reside  in  the  allocation  zone.  After  the  zone  reset  we  can  proceed  as  described  before  (5). 

Another  possible  situation  is  the  exhaustion  of  the  allocation  zone.  When  this  situation  is 
detected  we  perform  several  garbage  collection  increments  to  deallocate  all  windows  from  the 
allocation  zone,  which  in  turn  triggers  the  zone  reset  mechanism. 

In  the  unlikely  event  that  we  reach  the  bottom  of  available  address  space  we  perform  a  full-heap 
garbage  collection  and  move  all  live  data  back  to  the  highest  end  of  the  address  space.  We  describe 
full-heap  collection  below. 
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3.2.1  Write  Barrier 


By  using  the  heap  in  address  order  we  are  able  to  use  the  same  inexpensive  write  barrier  as  the  one 
used  in  traditional  generational  collectors  [BM02ll.  conceptually  defined  by  this  code: 

public  static  final  void 
writeBarrier 

(ADDRESS  source,  ADDRESS  target)  { 
if  (source  <  ((target  >»  WIND0W_SIZE_L0G) 

«  WIND0W_SIZE_L0G) )  { 

GCTk_WriteBuf f erSlot . insert (source) ; 

> 

} 


The  optimizing  compiler  translates  this  Java  phrase  into  an  efficient  three-instruction  sequence 
on  the  PowerPC,  a  mask,  a  compare,  and  a  conditional  branch  I  SUB  02 1.  As  we  show  later,  the 
address-order  write  barrier  gives  a  consistent  performance  improvement  over  the  indirect  write  bar¬ 
rier  used  in  the  previous  implementation  of  the  older-first  algorithm  1SHB+021.  which  performed 
table  lookups  to  map  object  addresses  in  a  small  address  space  to  logical  object  ages. 

3.2.2  Remembered  Sets 

We  map  windows  to  their  corresponding  remembered  sets  by  extracting  the  remembered  set  num¬ 
ber  directly  from  the  target  address: 

public  static  final  void 
conditionalRemset Insert 
(ADDRESS  source,  ADDRESS  target)  { 

if  (source  <  ((target  »>  WIND0W_SIZE_L0G) 

«  WIND0W_SIZE_L0G) )  { 

GCTk_ReraeraberedSet . insert ( (int) ( (target 
«  (BITS_IN_ADDRESS  -  Z0NE_SIZE_L0G  -  1)) 

»>  (BITS_IN_ADDRESS  -  Z0NE_SIZE_L0G  -  1  + 

WIND0W_SIZE_L0G) ) ,  source) ; 

> 

> 


In  order  for  this  approach  to  work,  the  number  of  remembered  sets  must  be  equal  to: 


_  zone  size 

remsets  —  2x - - - 

windowsize 


In  order  to  keep  the  overhead  of  remembered  sets  relatively  low,  it  is  beneficial  to  have  as  few 
remembered  sets  as  possible.  This  can  be  achieved  by  always  keeping  the  window  size  as  large  as 
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possible  for  a  particular  heap  size.  On  the  other  hand,  having  large  windows  hurts  incrementality; 
hence  a  large  window  may  not  always  be  a  good  solution.  Another  way  to  keep  the  overhead  of  the 
remembered  sets  relatively  low  may  be  remembered-set  triggered  GC,  wherein  a  GC  increment  is 
performed  if  the  size  of  the  remembered  sets  reaches  some  upper  threshold. 


3.3  Full  Heap  Collection 

We  implemented  a  full-heap  collection  mechanism  in  order  to  make  the  RealOF  algorithm  com¬ 
plete.  The  mechanism  is  in  principle  necessary  when  the  algorithm  runs  out  of  address  space  and 
needs  to  move  all  data  from  the  lowest-addressed  zone  back  to  the  top  of  the  address  space;  this, 
however,  will  happen  only  in  extremely  long-running  programs.  The  mechanism  can  also  be  in¬ 
voked  to  honor  System. gc()  hints  from  the  application.  Lastly,  it  may  be  invoked  adaptively,  if 
high  survival  rates  suggest  the  presence  of  garbage  cycles  spanning  multiple  windows.  However, 
we  do  not  have  adaptive  heuristics  worked  out  yet,  none  of  the  tested  programs  comes  close  to 
exhausting  the  address  space,  and,  for  fair  comparison,  we  ignore  System. gc()  hints,  as  they  are 
ignored  by  the  remaining  collectors  implemented  in  GCTk.  (If  we  do  honor  them,  the  minimum 
heap  size  requirement  for  the  RealOF  algorithm  executing  the  _213_javac  benchmark  decreases 
from  88  MB  to  56  MB,  compared  with  64  MB  for  both  Appel-style  generational  and  Beltway  col¬ 
lectors.) 

In  a  system  with  the  full-heap  collection  mechanism  we  use  a  third  type  of  zone  called  the 
reserved  zone ,  Figure  [5]  The  reserved  zone  serves  as  a  place-holder  for  a  temporary  copy  zone  in 
the  event  of  full-heap  collection,  and  following  full-heap  garbage  collection  it  becomes  the  new 
allocation  zone.  At  other  times  it  is  neither  used  nor  mapped  into  the  address  space  of  the  process. 


reserved  zone  copy  zone  allocation  zone 


(1) 


reserved  zone  codv  zone 

allocation  7 

Dne 

(2) 

C 

r 

reserved  zone  copy  zone  allocation  zone 


(3) 


reserved  zone  codv  zone 

allocation  zonr 

(4) 

C 

r 

reserved  zone 

codv  zone 

allocation  zone 

(5) 

c 

r 

2  -1 


Figure  5:  Full  heap  collection  in  the  RealOF  algorithm  with  three  windows  in  the  heap  (completely 
or  partially  gray  rectangles)  and  one  copy  reserve  window  (cr).  Snapshots,  top  to  bottom:  (1) 
before  any  GC  occurred;  (2)  after  several  GCs,  with  most  windows  residing  in  the  allocation  zone; 
(3)  after  several  more  GCs,  with  most  windows  residing  in  the  copy  zone;  (4)  right  after  a  full  heap 
collection;  (5)  after  several  GCs  following  a  full  heap  collection,  with  most  windows  residing  in 
the  allocation  zone. 
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4  Results 


4.1  Experimental  Setting 

We  use  the  Jikes  RVM  optimizing  compiler  both  to  build  the  boot  image  (the  virtual  machine 
itself)  and  to  compile  the  application  code  at  run-time.  (Note  that  run-time  compilation  activity  is 
included  in  the  reported  results.)  We  use  the  so-called  Fast  configuration,  which  skips  assertions 
checks  and  pre-compiles  all  the  classes  of  the  virtual  machine  into  the  boot  image.  Our  hardware 
platform  is  an  Apple  G5,  with  a  PowerPC  970  processor  at  2  GHz  and  2  GB  of  memory,  running 
an  early-beta  version  of  the  64-bit  Yellow  Dog  Linux  3.0.1  for  G5  with  the  2.6.1  kernel. 

Our  benchmark  programs  are  the  GC -relevant  programs  from  SPECjvm98  !Sta99  .  DH98I.  and 
SPECjbb2000  UStaOll.  Some  characteristics  of  the  benchmarks  are  summarized  in  Table  □  We 
assume  that  SPECjvm98  programs  are  representative  of  short-running  client  applications,  whereas 
SPECjbb2000  is  representative  of  server  applications. 


Benchmark 

Description 

Minimum  Heap 

Maximum  Heap 

Total  Allocation, 

Size,  MB 

GCs 

Size,  MB 

GCs 

MB 

SPECjvm98  jess 

Java  Expert  System  Shell 

24 

443 

144 

17 

956 

SPECjvm98  db 

A  database  simulation  program 

40 

150 

96 

16 

386 

SPECjvm98  javac 

Java  compiler  from  JDK  1.0.2 

64 

227 

256 

15 

1365 

SPECjvm98  jack 

A  Java  parser  generator 

32 

347 

160 

16 

930 

SPECjbb2000  -  1 

Emulates  a  3-tier  system 
with  1  warehouse 

96 

862 

640 

28 

4928-6292 

SPECjbb2000  -  2 

Emulates  a  3-tier  system 
with  2  warehouses 

128 

926 

640 

42 

4011-6039 

SPECjbb2000  -  4 

Emulates  a  3-tier  system 
with  4  warehouses 

196 

783 

640 

49 

3732-5707 

SPECjbb2000  -  8 

Emulates  a  3-tier  system 
with  8  warehouses 

352 

543 

640 

83 

3785-5324 

Table  1:  Benchmark  information  including  the  number  of  garbage  collections  performed  by  the 
Appel- style  collector. 

For  the  SPECjvm98  benchmarks,  our  performance  metric  is  the  running  time  of  the  program. 
For  SPECjbb2000,  however,  the  SPEC  benchmarking  procedure  fixes  the  running  time,  and  the 
benchmark  itself  reports  the  measured  throughput  as  the  number  of  transactions  per  second,  so 
this  is  our  performance  metric.  (Because  the  amount  of  useful  work  varies  with  the  efficiency 
of  collection  and  in  turn  on  heap  size,  the  Total  Allocation  column  of  Table  U  contains  ranges  of 
values  in  the  SPECjbb2000  rows.) 

The  garbage  collectors  compared  are  an  Appel- style  two-generation  collector  [{App89||  (la¬ 
belled  2G  in  plots);  the  Beltway  collector  HBJMM02I  in  its  default  25.25.100  configuration;  the 
older- first  collector,  with  the  indirect  write  barrier  and  window  size  of  25%  of  the  heap  ISHB+02  I 
(labelled  OF  in  plots);  and  the  RealOF  collector,  described  in  Section 0 
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We  ran  each  benchmark  and  each  collector  with  a  wide  range  of  heap  sizes,  careful  to  include 
the  smallest  heaps  in  which  the  programs  are  able  to  complete.  Each  such  configuration  was  run 
three  times,  and  for  final  results  we  report  the  best  run.  (We  intend  to  repeat  the  experiments  and 
get  a  proper  characterization  of  variance;  it  appears  to  be  small.) 

4.2  Running  Time  and  Throughput 

In  Figure  [6]  we  show  the  total  execution  time  of  the  several  garbage  collection  algorithms  as  the 
heap  size  is  varied.  The  Appel-style  collector  is  known  to  provide  excellent  throughput  (i.e.,  to 
have  low  GC  overhead  measured  as  fraction  of  total  execution  time),  and  therefore  we  also  provide 
Figure  |7]in  which  the  total  execution  times  of  each  collector  are  divided  by  the  total  execution  time 
of  the  Appel-style  collector. 

Consistent  with  our  expectations,  using  a  fast  write  barrier  makes  RealOF  uniformly  faster 
than  OF  across  all  benchmarks.  We  now  examine  how  RealOF  behaves  for  different  benchmarks. 
Somewhat  surprisingly,  in  some  cases  (notably  SPECjvm98  jess)  after  the  heap  size  becomes  rel¬ 
atively  large  for  a  particular  benchmark,  the  performance  of  RealOF  begins  to  slightly  decrease. 
We  have  determined  that  this  happens  because  after  some  point  the  cost  of  processing  increasingly 
large  numbers  of  remembered  pointers  outweighs  the  benefits  of  a  larger  heap  (and,  with  a  fixed 
window  size,  the  total  number  of  pointers  remembered  for  all  windows  grows  in  rough  proportion 
to  the  number  of  windows,  i.e.,  heap  size). 

From  the  measurements  of  RealOF  with  different  window  sizes  in  SPECjvm98  ,  we  conclude 
that,  in  general,  larger  window  size  leads  to  better  performance,  as  soon  as  the  larger  window  size  is 
feasible.  There  are  two  reasons  for  this.  First,  for  smaller  window  sizes  we  have  to  invoke  garbage 
collection  more  frequently  and  the  total  cost  of  invoking  several  smaller  garbage  collections  is 
at  least  as  high  or  higher  than  the  cost  of  invoking  one  larger  collectionjj  Second,  collecting 
a  bigger  window  we  are  able  to  free  more  space.  Since  one  bigger  window  encompasses  two, 
four,  etc.  smaller  windows,  some  inter-window  pointers  (a  burden  on  remembered  sets)  turn  into 
intra-window  pointers  (no  cost),  resulting  in  diminished  pointer  processing  time  and  a  reduction  of 
garbage  unnecessarily  retained.  Indeed,  we  find  that  having  four  or  five  windows  in  the  heap  gives 
the  best  results,  consistent  with  the  15-25%  estimates  for  the  optimal  window  to  heap  ratio  from 
our  previous  work. 

Thus,  it  appears  that  in  some  cases  it  would  be  beneficial  to  have  a  simple  adaptive  window 
resizing  mechanism.  It  would  be  responsible  for  setting  an  initial  window  size  for  a  given  heap  size, 
and  for  switching  to  the  largest  possible  window  size  during  execution  when  the  heap  is  resized 
dynamically.  This  incurs  the  one-time  cost  of  reorganizing  the  remembered  sets,  which  could  be 
piggybacked  onto  a  garbage  collection,  and  the  cost  of  adaptively  changing  the  write  barrier  code0 

Overall,  other  than  for  SPECjvm98  jess,  there  are  configurations  of  RealOF  that  are  either 
better  or  about  as  good  as  the  Appel-style  generational  collector.  We  have  separately  carried  out 

3Here  we  are  not  concerned  with  pause  times  but  only  with  collector  throughput  performance. 

4In  the  Jikes  RVM  baseline  compiler,  this  is  simply  a  matter  of  recompiling  one  method;  in  the  optimizing  compiler, 
which  normally  inlines  the  write  barrier,  we  must  either  recompile  all  methods  (unwieldy  and  expensive)  or  code  the 
write  barrier  using  variable-amount  shift  instructions  instead  of  fixed-amount  shift  instructions  (both  available  on  the 
PowerPC)  and  dedicate  a  register  to  hold  the  shift  amount,  i.e.,  the  logarithm  of  window  size. 
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Figure  6:  Absolute  execution  time  for  SPECjvm98  benchmarks  for  different  garbage  collectors. 
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Figure  7:  Relative  execution  time  for  SPECjvm98  benchmarks  for  different  garbage  collectors 
(shown  relative  to  the  Appel-style  collector,  2G)l  J^ower  is  better. 


heap  profiling  studies  of  SPECjvm98  jess  using  exact  tracing  llHBM+02l.  and  we  know  that  the 
large  amounts  of  permanent  data  allocated  in  that  program  hurt  the  performance  of  RealOF. 

Turning  to  the  SPECjbb2000  runs,  Figure  [8]  and  Figure  [9]  we  see  that  in  tight  heaps  the  Re¬ 
alOF  algorithm  tends  to  have  the  highest  throughput  overall  among  all  collectors  tested.  Not  only 
RealOF  significantly  decreases  the  minimum  heap  requirements  (Figure  [8]),  especially  in  the  8 
“warehouses”  case  (224  MB  with  RealOF  vs  352  MB  with  Appel-style  and  Beltway),  but  also  pro¬ 
vides  very  good  improvement  in  throughput  (Figure  |9j),  up  to  2.4  times  in  the  4  “warehouses” 
configuration. 

The  overall  performance  benefit  of  the  RealOF  collector  over  other  collectors  tends  to  slightly 
increase  with  the  number  of  “warehouses”,  so  that  in  the  SPEC-standard  configuration  with  8 
warehouses  RealOF  collector  performs  better  than  Appel-style  in  the  whole  range  of  heaps. 

In  summary,  RealOF  shows  performance  competitive  with  generational  collection  on  observed 
client-side  programs,  and  significantly  better  on  server-side  programs. 


5  Concluding  remarks 

Our  results  demonstrate  that  for  Java  server  applications,  a  large  address  space  with  an  equitable, 
fast  write  barrier  confers  a  clear  performance  advantage  on  the  older- first  algorithm  over  traditional 
generational  collectors.  Importantly,  the  advantage  is  most  pronounced  for  small  heap  sizes. 

In  previous  work  we  showed  that  the  distribution  of  pause  times  incurred  by  the  OF  collector  is 
favorable  ISHB+021.  The  differences  introduced  in  the  RealOF  implementation  should  not  affect 
the  pause  time  distribution,  but  this  remains  to  be  confirmed  experimentally. 

However,  remembered  set  maintenance  and  permanent  data  remain  a  potential  weak  spot  of 
the  algorithm  that  can  hurt  its  performance  on  some  programs.  Therefore  in  our  current  and  fu¬ 
ture  work  (using  the  successor  to  GCTk,  MMTk,  included  in  newer  releases  of  JikesRVM)  we 
are  investigating  remembered  set  triggers  and  hybrid  models  in  which  the  basic  idea  of  the  algo¬ 
rithm  will  be  combined  with  recent  advances  in  object  lifetime  prediction  and  allocation-time  or 
collection-time  pretenuring. 
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